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Abstract 


State-of-the-art image-set matching techniques typically implicitly model each image-set with 
a Gaussian distribution. Here, we propose to go beyond these representations and model image- 
sets as probability distribution functions (PDFs) using kernel density estimators. To compare and 
match image-sets, we exploit Csiszar /-divergences, which bear strong connections to the geodesic 
distance defined on the space of PDFs, i.e., the statistical manifold. Furthermore, we introduce 
valid positive definite kernels on the statistical manifolds, which let us make use of more powerful 
classification schemes to match image-sets. Finally, we introduce a supervised dimensionality 
reduction technique that learns a latent space where /-divergences reflect the class labels of the data. 

Our experiments on diverse problems, such as video-based face recognition and dynamic texture 
classification, evidence the benefits of our approach over the state-of-the-art image-set matching 
methods. 

1. Statistical Manifolds and /-Divergences 

In this paper, we rely on probability density functions (PDFs) and on the distances between them to 
analyze image-sets. In this section, we therefore review some concepts related to the geometry of 
the space of PDFs. 

Let A be a set. A (PDF) on A is a function p : A —)• M+ such that f^p(x)dx = 1. Let A4 
be a family of PDFs on the set A. With certain assumptions (e.g., differentiability), Ai forms 
a Riemannian structure, i.e., a differentiable manifold equipped with a Riemannian metric. The 
Riemannian metric enables us to measure the length of curves/ 

The Fisher-Rao metric (Radhakrishna Rao, 1945) is undoubtedly the most common Riemannian 
metric to analyze Ai. Unfortunately, an analytic form of geodesic distance induced by the Fisher- 
Rao metric can only be obtained for specific distributions, such as Gaussians, or the exponential 


1. On a Riemannian manifold, the geodesic distance between two points corresponds to the length of the shortest path 
on the manifold between them. 
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family (Radhakrishna Rao, 1945)^. In other words, estimating geodesic distances between general 
distributions, such as the ones we use here, is impractical. To address this issue, here, we propose 
to compare PDFs with two /-divergences, which, as discussed later, have strong connections with 
the geodesic distance. 

Formally, a Csiszar /-divergence is a function of two probability distributions that measures 
their similarity, and is defined as 

Sfiph) = J f 

where / is a convex function on (0,cx)) with /(I) = 0. Different choices of / yield different 
divergences. Below, and in the rest of this paper, we focus on two special cases, which induce the 
Hellinger distance and the Jeffrey divergence, respectively. 

The Hellinger distance can be obtained by choosing f{t) = (y/ — 1)^ in Eq. 1, and is defined 
below. 

Definition 1 The Hellinger distance between two probability distributions p and q is defined as 

^Hip\\Q) = j (y/p(®) - dx . (2) 

If, instead, we set f{t) = tln{t) — ln(f) in Eq. 1, we obtain the Jeffrey divergence defined 
below. 

Definition 2 The Jejfrey or symmetric KL divergence between two probability distributions p and q 
is defined as 

dj{p\\q) = j (p{x) - q{x)^ In ^^^dx . (3) 

Erom a geometrical point of view, the Riemannian structure induced by the Hellinger distance 
is different from the one induced by the J-divergence. However, these two divergences share the 
property that their respective Riemannian metrics can be obtained from the Eisher-Rao metric (see 
Thereom 5 in (Amari and Cichocki, 2010)). Eurthermore, in (Baktashmotlagh et ah, 2014), it 
was shown that the length of any given curve is the same under the Hellinger distance and un¬ 
der the Eisher-Rao metric up to scale. These two properties therefore relate these divergences to the 
geodesic distance on the statistical manifold, and thus make them an attractive alternative to com¬ 
pare PDEs. Another important property of these two divergences is given by the following theorem. 


Theorem 3 The Hellinger distance and the Jejfrey divergence between two distributions are invari¬ 
ant under differentiable and invertible (diffeomorphism) transformations. 

That is, given two distributions pi{x) and P 2 {x) in space X;x G X,\et h : X —3^ be a dif¬ 
ferentiable and invertible function that converts x into y. Under function h, we have pi{x)dx = 
qi{y)dy; i G {1, 2} and d{y) = {^{xjdx where \ J{x)\ denotes the determinant of the Jacobian 
matrix h. The above invariance property states that 

5f{pi{x),p2{x)) = Sf{qi{y),q2{y)). 

2. Not to be confused with exponential distributions, or distributions generally expressed as sums of exponentials. 
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The proof of this theorem can be found in several recent studies {e.g., see Theorem. 1 in (Qiao 
and Minematsu, 2010)). It has also been known to mathematicians for decades (Ali and Silvey, 
1966). Invariance to diffeomorphism seems an attractive property in the context of computer vision, 
and in particular for image-set matching, since images in a set can typically be subject to many 
variations, such as changes of illumination or of environment/capture conditions. Furthermore, we 
will exploit this property when deriving our dimensionality reduction method in Section 4. Note that 
the affine invariance of some metrics on the SPD manifold, which has made such metrics popular, 
is a lesser form of this property. In other words, the /-divergences are invariant to a broader set of 
transformations, including affine ones. 

Before concluding fhis section, if is worfh mentioning fhe following relationship. 


Remark 4 The Bhattacharya distance between two distributions is defined as 5B{p\\q) = — ln.{B{p,q)) 
where is the Bhattacharya coefficient defined as 


B{p,q) 


y/p{x)q{x)dx . 


The Bhattacharya coefficient and the Bellinger distance are related through S‘jj{p\\q) = 2 — 
2B{p,q). Interestingly, the Bhattacharya distance between two zero mean multivariate normal 
distribution p = M{0, Sp) and q = M{0, Sg) is proportional to the Stein divergence between 
their corresponding covariance matrices. More specifically, SB{p,q) = 0.5S'(5]p, 5]g) which can 
be proved by direct insertion. The Stein divergence has been successfully used in addressing several 
problems in vision (Cherian et al., 2013; Zhang et ai, 2015). 

2. Image-Sets as PDFs 

We now infroduce our approach fo modeling image-sefs as PDFs. To fhis end, lef be a 

sef of n images, where each Xi G is a feafure vector describing one image in fhe sef. We 
propose fo make use of Kernel Densify Esfimafion (KDE) fo obfain a fine-grained esfimafe p{x) of 
fhe disfribufion p{x) of fhe feafures. This esfimafe can be wriffen as 

Given fwo image-sefs and Eq. 4 provides us wifh fhe means fo esfimafe 

fheir respective PDEs p{x) and q{x). We fhen aim fo compare fhese image-sefs by computing 
fhe sfafisfical disfances befween p{x) and q{x). As discussed in Section 1, we propose fo rely on 
/-divergences fo compare p{x) and q{x). Nofe, however, fhaf fhe infegrals corresponding fo fhe 
Hellinger disfance and fo fhe J-divergence do nof have an analyfic solufion for our KDE represenfa- 
fion. Therefore, below, we derive solutions fo compufe a robusf esfimafe of fhese fwo divergences. 

2.1 Empirical /-Divergences 

Eef us firsl consider fhe case of fhe Hellinger disfance. Given fwo sefs of samples and 

drawn from p{x) and q{x), respectively, direcfly estimating fhe integral of Eq. 2 is nof 
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straightforward. To make this easier, one can rewrite Eq. 2 as 



where Ep{-) denotes the expectation under p. Following the strong law of large numbers, as is 
commonly done in practice, such an expectation can then be estimated as 


^h{p\\q) 


1 / 




\ P{x. 


(p)^ 


(V) 


with p{-) and q{-) obtained by KDE. Unfortunately, such an estimate would in general be different if 
one had chosen to make use of q{x) instead of p{x) to derive the expectation of Eq. 6. This implies 
that the resulting estimate of the Hellinger distance would be asymmetric, and thus poorly-suited to 
our goals. It is now clear that if the realizations of q{x), i.e. are used to estimate 

a different result is attained as in this case we have 5‘jj{p\\q) = Eqil — j and hence 



1 / 




q{x^^^) 


( 8 ) 


While one could compute the average of Eqn. (6) and Eqn. (8) to obtain a less biased and 
symmetric estimation of 6'jj{p\\q), we follow the approach of (Carter, 2009) to obtain a more robust 
estimate of the Hellinger distance. More specifically, we rewrite ^h{p\\(1) as 




pjx) 


p{x) +q{a 


Qjx) 


p{x) + q{x) 


{p{x) + q{x))da 


with 


(ph) =J{ 

Ep(^^/T{x) - Vl - T{x)y + E,(^^/T{x) - Vi - T{x)y , 

p{x) 


T{X) = 


Given our two sets of samples and {xf’Yih^ this allows us to estimate the Hellinger 

distance as 


p{x) + q{x) 


( 9 ) 


( 10 ) 


5i{p\\q) = - E Unx^F'^) -yji- nx^Y^)^ 

j V / 

+ - E - T{x^Y^)\ . (11) 

i \ J 


The benefits of this approach are twofold. First, the resulting estimate is symmetric. Second, 
and maybe more importantly, the denominator of T(-) alleviates the potential instabilities that low 
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probabilities of samples under either qor p would have resulted in by making use of the formulation 
in Eq. 7, or of its counterpart in terms of q. Note that such low probabilities are quite common when 
relying on KDE with high-dimensional data. 

In the case of the Jeffrey divergence, we make use of the same idea as for the Hellinger distance. 
We therefore express the J-divergence in terms of T(-), which, after some derivations, yields 




1 '^p 

-^(2T(a;J^))-l)ln 

^ 7 




nq 

+ -Y,{2T{xf)-\)\u 

’ 7 




( 12 ) 


Our two empirical estimates give us practical ways to evaluate the distance between two image- 
sets represented by their PDEs. Given a training set of m image-sets and a query image-set, match¬ 
ing can then simply be achieved by computing the distance of the query to all training image-sets, 
and choosing the nearest one as matching set. 


3. Kernels on the Statistical Manifold 

In the previous section, we have introduced an approach to comparing the distributions of two 
image-sets using empirical estimates of the Hellinger distance or of the J-divergence. Such an 
approach, however, only allows us to make use of a nearest neighbor classifier. Kernel methods 
(e.g., Kernel Eisher discriminant analysis), however, provide much more powerful tools to perform 
classification, and thus image-set matching. Here, we therefore turn to the question of whether the 
divergences defined in Secfion 1 can generafe valid positive definile (pd) kernels. To answer fhis 
question, lef us firsl define fhe notion of pd kernels. 

Definition 5 (Real-valued Positive Definite Kernels) Let X be a nonempty set. A symmetric func¬ 
tion k : X X X ^ is a positive definite (pd) kernel on X if and only if 

n 

CiCjk{xi,Xj) > 0 (13) 


for any n ^ N, Xi € X and Ci G M. 

Eor the J-divergence, the kernel 

k.j{p, q) = exp{-a6j{p\\q)) (14) 

was introduced in (Moreno et al., 2004), although without a formal proof of positive definiteness. 
Eater, in (Hein and Bousquet, 2005), a conditionally positive definite (cpd) kernel based on a smooth 
version of the J-divergence was derived. To the best of our knowledge, a counter-example that 
disproves the positive definiteness of A:j(-, •) has never been exhibited in the literature. Therefore, 
in our experiments, we assumed that kj{-, •) is pd. Please note that the kernel in Eq.(14) is not a 
Eaplace kernel since 6j denotes the J-divergence not Sj. The reason behind our choice of 6j over 
6j is that the unlike the Hellinger distance, J-divergence is not a metric (this can be shown by a 
counter-example). 
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In the case of the Hellinger distance, a conditionally positive definite (cpd) kernel was studied 
in (Hein and Bousquet, 2005). Here, in contrast, we derive valid pd kernels. More precisely, we 
do not only introduce a single pd kernel, but rather provide a recipe to generate diverse pd kernels 
on the statistical manifold. Our derivations rely on the definition of negative definite kernels given 
below. 

Definition 6 (Real-valued Negative Definite Kernels) Let X be a nonempty set. A symmetric 
function if; : XxX ^ Mis a negative definite (nd) kernel on X if and only ifY17j=i CiCj/c(xj, Xj) < 
0/or any n G N, Xi G X and c* G M with ^11=1 ^ — 0- 

Note that, in contrast to positive definite kernels, an additional constraint of the form 'ffci = 0 is 
required in the negative definite case. Given this definition, we now prove that the Hellinger distance 
is nd. 

Theorem 7 (Negative Definiteness of the Hellinger distance) Lot M. denote the statistical mani¬ 
fold. The Hellinger distance : Af x Af —)• M"*" is negative definite. 

Proof 


E = E / {AmA - fpXfdx 

■,j=t ^ 

N N . N N . 

/ Pi{x)dx 

i=l j=l j=l i=l 

-2 ^ acj / JPi{x)pj{x)dx 
i,j=i 

N N - 

=-2 / '^Cis/pi{x)'^CjJpj{x)dx 
^ i i 

N 

= -2 / II ^^Ciy/pi{x)\\^dx < 0 , 


where the terms in the second line have disappeared due to the constraints ^ • a = 0, resp. 
Cj = 0, and to the fact that the integrals have value 1 for any i, resp. j. ■ 


We then make use of the following theorem, which originated from the work of I. J. Schoenberg 
(1903-1990). 

Theorem 8 (Theorem 2.3 in Chapter 3 of (Berg et ah, 1984)) Let p, be a probability measure on 
the half line M"*“ and 0 < tdp{t) < oo. Let be the Laplace transform of p, i.e., L^(s) = 
/q e~^^dp{t), s G C. Then, C^lPf) is positive definite for all f3 > 0 if and only if f : X x X ^ 
is negative definite. 

Theorem 8 provides a general recipe to create pd kernels. In particular, here, we focus on the 
Gaussian and the Laplace kernels, which have proven powerful in Euclidean space. The Gaussian 
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kernel can be obtained by choosing fi{t) = 6{t — 1) in Theorem 8, where S denotes the Dirac 
function. On the statistical manifold, this kernel can then be written as 

kH{p,q) = exp(-cr5|^(p,g)), CJ > 0 . (15) 

To derive the Laplace kernel on the statistical manifold, we must further rely on the following 
theorem. 

Theorem 9 (Corollary 2.10 in Chapter 3 of (Berg et al., 1984)) If ijj : X x X ^ M. is negative 
definite and satisfies £c) ^ 0 then so is ip^for 0 < a < 1. 

As a consequence, by choosing and a = 1/2 in Theorem 9, we have that (5 h(-, •) is nd. 

Then, applying Theorem 8 with 6h{-, •) and /r(t) = — 1) lets us derive the Laplace kernel on 

the statistical manifold 


klip, q) = ex.p{-a6H{p, q)), a > 0 . (16) 

Remark 10 The Hellinger distance can be thought of as the chordal distance between points on an 
infinite-dimensional unit hyper-sphere. More specifically, the square root function is a diffeomor- 
phism between the statistical manifold and the unit hyper-sphere. In (Srivastava et ai, 2007), this 
was exploited to estimate the distance between discretized PDFs as the geodesic distance on the 
corresponding (finite-dimensional) hyper-sphere. Such a distance, however, cannot induce a valid 
positive definite Gaussian kernel, since the Gaussian kernel produced by the geodesic distance on 
a Riemannian manifold is not positive definite unless the manifold is flat (Feragen et ai, 2015). In 
contrast, as shown above, our divergences yield valid positive definite kernels, which allow us to 
exploit more sophisticated classification methods. 

Remark 11 Note that the discussion above proves the positive definiteness of kernels defined with 
the exact Hellinger distance, and does not necessarily extend to its empirical estimate. However, 
since the strong law of large numbers guarantees convergence of our empirical estimate to the true 
distance, given sufficiently many samples, the resulting empirical kernels will also be pd. 

In our experiments, we used the kernels derived above to perform kernel discriminant analy¬ 
sis. Image-set matching was then achieved by using the Euclidean distance in the resulting low¬ 
dimensional latent space. 

4. /-Divergences for Dimensionality Reduction 

The methods described in Sections 2 and 3 directly compare the distributions of the original fea¬ 
tures of each image-set. As mentioned earlier, with high-dimensional features that are common in 
computer vision, KDE may produce very sparse PDEs (i.e., PDEs that are strongly peaked around 
the samples and zero everywhere else), which may be less reliable to compare. To address this 
issue, here, we propose to learn a mapping of the features to a low-dimensional space, such that the 
/-divergences in the resulting space reflect some interesting properties of the data. 

As will be shown shortly, we will formulate the dimensionality reduction as a problem on the 
Grassmann manifolds. The use of Grassmann and Stiefel manifolds for dimensionality reduction is 
an emerging topic in machine learning. Two notable examples are robust PGA using Grassmannian 


7 


by Hauberg et al. (Hauberg et al., 2014) and linear dimensionality reduction using Stiefel manifolds 
by Cunningham and Ghahramani (Cunningham and Ghahramani, 2015). 

Focusing on the supervised scenario, we search for a latent space where two image-sets are 
close to each other (according to the /-divergence) if they belong to the same class and far apart 
if they don’t. That is, given a set of image-sets X = {Xi, • • • , Xm}, where each image-set 
Xi = G M^, our goal is to find a transformation W G such that the /- 

divergences between the mapped image-sets | { W'^xf^ }?=!} . encode some interesting structure 

of the data. Here, we represent this structure via an affinity function a(Xj, Xj) that encodes pair¬ 
wise relationships between the image-sets. This affinity function will be described in Section 4.1. 

Since we aim for the /-divergences to reflect this affinity measure, we can write a cost function 
of the form 

= Y, , (17) 

id 

where <5 : Af x Af —is either or Sj{■,■), and where we sum over all pairs of image- 

sets. To avoid possible degeneracies when minimizing this cost function w.r.t. W, and follow¬ 
ing common practice in dimensionality reduction, we enforce the solution to be orthogonal, i.e., 
W'^W = Id- This allows us to write dimensionality reduction as the optimization problem 

W* = argmin C{W) (18) 

w 

s.t. W^W = Id - 

Below, we show that, for both divergences of interest, i.e., the Hellinger distance and the J-divergence, 
(21) is a minimization problem on a Grassmann manifold. 

The Grassmann manifold Q{d, D) is the space of d-dimensional subspaces in and corre¬ 
sponds to as a quotient space of the Stiefel manifold {i.e., the space of d-dimensional frames in 
M^, or in other words orthogonal D x d matrices) (Edelman et al., 1999). More specifically, the 
points on the Stiefel manifold that form a basis of the same subspace are identified with a single 
point on the Grassmann manifold. As such, a minimization problem with orthogonality constraint 
W^W = Id is a problem on the Grassmannian iff the cost of the problem is invariant to the choice 
of basis of the subspace spanned by W. Mathematically, minvi/£(VF) with W'^W = 1^ is a 
problem on the Grassmannian iff C{WR) = C{W), \/R G 0{d), where 0{d) denotes the group 
of d X d orthogonal matrices. Since transformations in Od are bijections, the invariance property of 
Theorem 3 directly shows that the cost function of (21) is invariant to the choice of basis. In other 
words, (21) can be solved as an unconstrained minimization problem on Q{d, D). 

In practice, to solve (21) on Q{d,D), we make use of Newton-type methods {e.g., the conju¬ 
gate gradient method). These methods inherently require the gradient of C{W). On Q{d, D), the 
gradient^ can be expressed as 

VwC{W) = {Id-WW^) grad C{W), (19) 


3. On an abstract Riemannian manifold At, the gradient of a smooth real function / at a point x € M, denoted hy 
V f{x), is the element of TcAf satisfying (^f{x),({}x ~ Df{x)[C], VC G T^^M, where Df{x)[l^] denotes the 
directional derivative of / at a: along direction C- 



where grad£(VF) is the D x d matrix of partial derivatives of C{W) with respect to the elements 
of W, i.e., 

|grad£(W)l., = 

The detailed derivations of giad C{W) for our two /-divergences are provided in the appendix. 

Remark 12 A weaker form of constraints on W could be achieved by exploiting the geometry of 
the oblique manifold (Absil and Gallivan, 2006). The oblique manifold OB{d, D) is a submanifold 
whose points are rank d matrices W subject to 

ddiag{W^W) = Id , 

where ddiag(A) denotes the diagonal matrix whose elements are those of the diagonal of A. Sim¬ 
ilarly to the Stiefel and Grassmann manifolds, a point on the oblique manifold is represented by a 
tall D X d matrix with unit norm columns. By contrast, however, these columns are not necessarily 
orthogonal. We found empirically that relying on these weaker constraints does not yield as good 
accuracy for image-set matching as our orthogonality constraints. This appears to be due to the 
fact that, when two columns of W become very similar, but still satisfy the rank constraint, the 
cost decreases, but the resulting latent space is less informative. This illustrates the importance of 
orthogonal columns, which favor diversity in the latent space. 

4.1 Affinity Measure 

As mentioned above, we propose to exploit supervised data to define the affinity measure used in 
the cost function of Eq. 17. Note that unsupervised approaches are also possible, for instance to find 
a mapping thaf preserves fhe closeness of pairs of image-sefs. 

In our supervised scenario, lef yi denote the class label of image-set Xi, with 1 < yi < C . We 
define fhe affinity between two sets Xi and Xj with labels yi and yj, respectively, as 

a{Xi, Xj) = gjj]{Xi,Xj) — gf,{Xi, Xj) , (20) 

where g.^ and gi, encode a notion of within-class similarity and between-class similarity, respec¬ 
tively. These similarities can be expressed as 

if Xi E N^iXf or Xj E N^{Xi) 

otherwise 

ifXiGNbiXj) or XjGNbiXi) 
otherwise 

where X^(Xj) is the set of nearest neighbors of Xi (according to the /-divergence) that share 
the same label as yi, and Ni,{Xi) contains the U), nearest neighbors of Xi having different labels. 
In our experiments, we defined as the minimum number of points in a class and found Ub < 
by cross-validation. 

Before moving to the next section, we would like to take a detour and discuss a couple of 
observations/insights on the dimensionality reduction algorithm. Examples below are all taken 
from the face recognition experiment on YouTube Celebrity (YTC) dataset (Kim et ah, 2008) (the 
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Figure 1: The behavior of the cost function depicted in Eq.(17) for a few iterations of the conjugate gradient 
method on Grassmannian. 


very first experiment in § 5). Fig. 1 shows the behavior of the cost function depicted in Eq.(17) for 
a few iterations of the conjugate gradient method on Grassmannian. In practice, we found that the 
algorithm would converge in most cases in less than 25 iterations. On a related note, each iteration 
of the conjugate gradient on Grassmannian for the aforementioned case took near 40 seconds on a 
quad core desktop machine. 

Fig. 2 shows the affinity matrix before and after training for eight representative classes of 
YTC dataset. Bright and dark regions represent high and low similarities, respectively. In the ideal 
case, the affinity matrix should compose of eight 3x3 sub-blocks sitting on the diagonal of the 
matrix. Fig. 2 suggests that after training, each class has become more compact and hence more 
discriminative. On top of this, computing distances in low-dimensional spaces is much faster than 
high-dimensional ones. For example in the case of YTC, computing 10,000 distances in the high¬ 
dimensional space took 100 seconds while after dimensionality reduction this time is reduced to 25 
seconds. 

5. Experimental Evaluation 

We now evaluate the algorithms introduced in the previous sections on diverse standard image- 
set matching problems. In particular, for our kernel-based approach, we make use of the kernel 
Fisher Discriminant Analysis (kFDA) algorithm. kFDA is a kernel-based approach to learning a 
discriminative latent space. Classification in the resulting latent space is then performed with a 
Nearest Neighbor (NN) classifier based on the Euclidean distance. In all our experiments, the 
dimensionality of the latent space, for both kEDA and our dimensionality reduction scheme, was 
determined by cross-validation. The kernel bandwidth, i.e., a in Eq.(14), Eq.(15) and Eq.(16) was 
chosen from {0.001, 0.005, 0.01, 0.05, 0.1,0.5,1}. In the remainder of this section, we refer to our 
different algorithms as 

NN-H: NN classifier based on the Hellinger distance. 
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Figure 2: The affinity matrix before and after training for eight representative classes of YTC dataset. 


NN-J: NN classifier based on the J-divergence. 
kFDA-HG: kFDA with the Hellinger distance (Eq. (15)). 
kFDA-HL: kFDA with the Hellinger distance (Eq. (16)). 
kFDA-J: kEDA with the J-divergence (Eq. (14)). 

NN-H-DR: NN classifier based on the Hellinger distance after dimensionality reduction. 

NN-J-DR: NN classifier based on the J-divergence after dimensionality reduction. 

Since, as mentioned before, our approach was motivated by techniques that exploit geometri¬ 
cal structures, such as SPD or Grassmann manifolds, which have proven effective for image-set 
matching, we compare our results against two such techniques. In particular, we make use of 
Grassmann Discriminant Analysis (GDA) (Hamm and Lee, 2008) and of Covariance Discrimi¬ 
native Learning (CDL) (Wang et al., 2012) as baseline algorithms, both which, as us, employ 
kEDA to match image-sets. Eor CDL, the kernel function ks : 5”_,_ x 5”_,_ —)• M is given by 
ks{A, B) = Tr(log(A)^ log(S)), where log is the principal matrix logarithm. Eor GDA, the ker¬ 
nel function kg : Q{p, n) x0{p,n) —)• M is the projection kernel defined as kg{A, B) = || 

In all our experiments, we used the same image features for our approach and for GDA and CDL. 
Einally, for each dataset, we also report the best result found in the literature, and refer to this result 
as State-of-the-art. 

5.1 Video-Based Face Recognition 

For the task of image-set-based face recognition, we used the YTC (Kim et al., 2008) and COX (Huang 
et al., 2013) datasets (see Fig. 3 for samples from YTC). The YTC dataset contains 1910 video clips 
of 47 subjects. The COX dataset includes 1,000 subjects, each captured by three cameras {i.e., 
3,000 videos in total). For the YTC dataset, we used face regions extracted from the videos, re¬ 
sized to 64 X 64, and described each region with a histogram of Local Binary Patterns (EBP) (Ojala 
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Figure 3: Samples from YouTube celebrity. 


et al., 2002). For the COX dataset, following (Huang et al., 2015), we used histograms of equalized 
intensity values as features^. 

For the YTC dataset, following the standard practice (Lu et al., 2013), 3 videos from each 
person were randomly chosen as training/gallery data, and the query set contained 6 randomly 
chosen videos from each subject. The process of random selection was repeated 5 times. For the 
COX dataset, we followed the test protocol of (Huang et al., 2015), and used videos of 100 randomly 
chosen subjects as training data, used only for dimensionality reduction. Then, 100 subjects were 
randomly chosen to form the gallery/probe sets for 6 different experiments. For each experiment, 
the camera number determines the gallery and probe sets. For example, COX 12 refers to the test 
scenario where videos captured by Caml and Cam2 are used as gallery and probe set, respectively. 
The random selection of training and gallery/probe sets was repeated 10 times. 

Table 1 shows the average accuracies of all methods on the YTC and COX datasets. For these 
datasets, the State-of-the-art baselines correspond to the metric learning approach of (Lu et al., 
2013) and the hybrid solution of (Huang et al., 2015), respectively. As far as geometrical methods 
are concerned, the results evidence that making use of the statistical manifold yields superior results 
compared to the Grassmann and SPD manifolds. This is even true for the direct NN classifiers based 
on our divergences, which are further improved by dimensionality reduction. This, we believe, 
demonstrates the benefits of relying on more accurate PDF representations of each image-set (i.e., 
KDE in our case versus single Gaussians for CDL and GDA). Furthermore, on both datasets, our 
algorithms either match or outperform the state-of-the-art. The exceptions are COX12 and COX32, 
which could be attributed to the more sophisticated classification scheme used in (Huang et al., 
2015). Note that, as acknowledged in (Huang et al., 2015), the hybrid method does not scale up 
well to large datasets. 

5.2 Dynamic Texture Recognition 

As a second task, we consider the problem of dynamic texture recognition using the DynTex-i-i- 
dataset (Ghanem and Ahuja, 2010). Dynamic textures are videos of moving scenes that exhibit 
certain stationary properties in the time domain (Ghanem and Ahuja, 2010). Such videos are perva¬ 
sive in various environments, such as sequences of rivers, clouds, fire, swarms of birds and crowds. 
DynTex-i-i- (Ghanem and Ahuja, 2010) is comprised of 36 classes, each of which contains 100 se¬ 
quences .We split the dataset into training and testing sets by randomly assigning half of the videos 
of each class to the training set and using the rest as query data. We used the LBP-TOP (Zhao and 

4. We found these features to be superior to LBP features on COX. 
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Table 1: Average recognition rates on the YTC and COX datasets. 


Method 

YTC 

COX12 

COX13 

COX23 

COX21 

COX31 

COX32 

GDA (Hamm and Lee, 2008) 

66.2 ± 9.7 

68.8 

77.7 

71.6 

66.0 

76.1 

74.8 

CDL (Wang et al., 2012) 

70.9 ± 3.2 

78.4 

85.3 

79.7 

75.6 

85.8 

81.9 

State-of-the-art 

78.2 

95.1 

96.3 

94.2 

92.3 

95.4 

94.5 

NN-H 

77.3 ± 4.5 

61.7 ± 4.2 

69.2 ± 4.0 

63.5 ± 2.3 

66.6 ± 5.0 

64.2 ± 4.2 

64.0 ± 3.5 

NN-J 

76.7 ± 5.1 

64.7 ± 4.1 

69.5 ± 3.3 

63.0 ± 2.4 

65.5 ± 5.1 

70.0 ± 3.9 

63.3 ± 3.6 

kFDA-HG 

78.9 ± 3.4 

90.8 ± 3.0 

96.0 ± 1.7 

92.9 ± 1.9 

92.5 ± 2.4 

95.8 ± 1.7 

93.4 ± 1.7 

kFDA-HL 

78.6 ± 4.7 

92.4 ± 2.1 

96.8 ±0.7 

94.7 ± 1.1 

92.2 ± 1.1 

96.6 ± 0.8 

93.7 ± 1.3 

kFDA-J 

79.4 ± 3.8 

91.5 ± 3.0 

95.9 ± 1.7 

93.0 ± 2.0 

92.5 ± 2.5 

95.6 ± 1.5 

93.5 ± 1.6 

NN-H-DR 

78.3 ± 3.7 

71.1 ± 4.0 

83.6 ± 3.5 

77.1 ± 4.1 

76.6 ± 3.6 

76.4 ± 4.5 

77.1 ± 3.5 

NN-J-DR 

79.3 ± 3.6 

72.3 ± 2.9 

82.6 ± 3.3 

75.6 ± 3.1 

73.2 ± 3.1 

81.8 ± 2.7 

70.4 ± 3.8 




Figure 4: Representative examples of the three classes (light, medium, and heavy) in the UCSD traffic video 
dataset (Chan and Vasconcelos, 2005). 


Table 2: Average recognition rates on the DynTex++, UCSD traffic and Maryland (abbreviated as ML) scene 
datasets. 


Method 

DynTex-i-+ 

Traffic 

ML-LOO 

ML 

GDA (Hamm and Lee, 2008) 

89.9 ±0.6 

92.5 ± 2.6 

81.5 

70.3 ±5.2 

CDL (Wang et al., 2012) 

89.0 ±0.9 

91.7± 1.9 

86.5 

76.7 ±7.8 

State-of-the-art 

93.8 

95.6 

77.7 

NA 

NN-H 

91.6 ±0.7 

91.3 ±4.2 

76.9 

71.2 ±3.1 

NN-J 

91.4 ±0.7 

91.0 ±4.5 

77.7 

71.4 ±3.0 

kFDA-HG 

94.7 ±0.4 

96.1 ± 1.5 

85.4 

78.1 ±4.4 

kFDA-HL 

94.9 ±0.7 

96.5 ± 1.5 

87.7 

79.0 ±3.1 

kFDA-J 

95.2 ±0.6 

97.3 ± 1.4 

86.9 

77.8 ±5.3 

NN-H-DR 

92.3 ±0.5 

94.9 ± 2.9 

80.8 

79.7 ±4.5 

NN-J-DR 

91.9 ±0.5 

95.6 ± 1.5 

82.3 

80.2 ±3.7 


Pietikainen, 2007) approach to represent each video sequence. Table 2 shows the average accura¬ 
cies for 10 random splits. To the best of our knowledge, (Ramirez Rivera and Chae, 2015) reported 
the highest accuracy on this dataset. Our kFDA-J approach yields a 1.4% improvement over this 
state-of-the-art. As before, we can observe a large gap between our approach and GDA or CDL. 

5.3 Scene Classification 

For scene classification, we employed the UCSD traffic datasef (Chan and Vasconcelos, 2005) and 
the Maryland (Shroff et al., 2010) scene recognition dataset. The UCSD traffic dafasef contains 254 
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Figure 5: Representative examples of three classes of the Maryland scene dataset (Shroff et al., 2010). From 
left to right; iceberg collapsing, tornado, and volcano eruption. 


video sequences of highway traffic of varying patterns {i.e., light, medium and heavy) in various 
weather conditions {e.g., cloudy, raining, sunny). See Fig. 4 for examples. We used HoG fea¬ 
tures (Dalai and Triggs, 2005) to describe each frame. Our experiments were performed using the 
splits provided with the dataset (Chan and Vasconcelos, 2005). We report the average accuracies 
over these splits in Table 2. The state-of-the-art results were reported in (Ravichandran et ah, 2011). 
Once again, we see that our kernel-based and dimensionality reduction algorithms comfortably out¬ 
perform GDA and CDL, as well as the previous state-of-the-art. 

As a last experiment, we used the Maryland dataset, which contains 13 different classes of dy¬ 
namic scenes, such as avalanches and tornados. See Fig. 5 for examples. This dataset is more 
challenging, and we observed that handcrafted features, such as LBP or HoG, do not provide suffi¬ 
ciently discriminative representations. Therefore, we used the last layer of the CNN trained in (Zhou 
et ah, 2014) as frame descriptors. We then reduced the dimensionality of the CNN output to 400 
using PGA. We first employed the standard Leave-One-Out (LOO) test protocol. Furthermore, we 
also evaluated the methods on 10 different training/query partitions obtained by randomly choos¬ 
ing 70% of the dataset for training and the remaining 30% for testing. The average classification 
accuracies for both protocols are reported in Table 2. Note that no state-of-the-art results has been 
reported in the literature on our second test protocol. Our approach outperforms the state-of-the-art 
result of Feichtenhofer (Feichtenhofer et ah, 2014) by more than 7%. While this may be attributed 
in part to the CNN features, note that our approach still outperforms GDA and CDL based on the 
same features. 


6. Conclusions and Future Work 

We have introduced a novel framework to model and compare image-sets. Specifically, we have 
made use of KDE to represent an image-set with its PDF, and have proposed practical solutions 
to employ /-divergences for image-set matching, including empirical estimates of /-divergences, 
valid pd kernels on the statistical manifold and a supervised dimensionality reduction algorithm 
inherently accounting for /-divergences in the resulting latent space. In the future, we plan to 
extend our learning scheme to the unsupervised and semi-supervised scenarios. Furthermore, we 
intend to study the use and effectiveness of other divergences to tackle the problem of image-set 
matching. 
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Appendix A. Gradient for Dimensionality Reduction 


The cost function to be minimized by our dimensionality reduction approach, given in Eq. 15 and 
Eq. 16 of the main paper, can be expressed as 


W* = argmin '^a{Xi,Xj) ■ 5{W'^Xi,W^Xj) (21) 

s.t. W^W = ld, 


where a(-, •) is an affinity measure that does not depend on W, 5 : A4 x A4 ^ is either the 
Hellinger distance (•,•), or the Jeffrey divergence 5j(-, •), and the sum is over all pairs of training 
image-sets. 

As shown in Eq. 17 of the main paper, to minimize this cost function by a gradient descent 
method on the Grassmannian, we need to compute the partial derivatives ■)/dW or dSj{-, ■)/dW. 

Since both and are expressed in terms of T(-) (defined in Eq. 9 of fhe main paper), 

we sfarf by deriving dTjdW. 


d 


T{W^x) 


d p{W'^x) 

^ p{W^x) + q{W^x) 


{p{W^x) + q{W^x)y 


q{W^x 


dp{W'^x) 


— p{W'^x) 


dq{W'^ x)\ 

) ' 


( 22 ) 


Erom fhe definifion of fhe empirical esfimafe of a disfribufion given in Eq. 4 of fhe main paper, fhe 
derivative of dp{W'^x)/dW (and similarly for dq{W'^x) / dW) can be written as 


^ p{W^x) = ^ 


dW 




-1 


npY^det(27rSp) ^ 


dW |npY^det(27rSp) ^ ^ 

( / rp 

< exp I — - — x^^y (x — X 


x-x'f>] WT.-^W'^ix-x 


Sp) 


Sp) 


X — xy ]\x — X 


Sp) 


WT.-\ 


(23) 
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According to the definition of the Hellinger distance given by Eq. 8 and Eq. 10 of the main 
paper, this lets us derive ■)/dW as 


d 

Ww 


4(plk) = (v^ 


T{w'^x) - Jl-T{W' 




T{W'^x) - J\-T{W‘ 




2Jt{W^x) 2Jl-T{W" 


dW 


T{W'^ 


EE,\^2(^^T{W^x)- ^l-T{W^x)^ 


T{W'^x) 2^l-T{W^x) 


T^^ / dw 


T{W'^x) 


= Er, 


+ E„ 


2T{W^x) - 1 


d 


\ T{W^x)(l-T{W^x 


2T{W^x) - 1 


dW 

d 


T{W'^ 


, \ T{W'^x)(l-T{W'^x) 


dW 


T{W^x) 


= -E 

n .. ^ 




2T{W'^x'f'>) - 1 


d 


^ T{W^x[P^)(l-T{W^x[^^)'^ 


2T{W^x[‘’'>) - 1 


dW 

d 


T{W^x 


T^(P)\ 


^ T{W^x[‘‘^)(l-T{W^x[‘‘'>)'^ 


dW 


T{W 




(24) 


which depends on dT/dW obtained from Eq. 22 and Eq. 23 above. 

Similarly, for the Jeffrey divergence defined inEq. 11 of the main paper, we can obtain d6j{-, ■)/dW 
as 
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Ww 


^APh) = {^T{W^x) - l) log 
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Using either of these derivatives lets us eompute the gradient of the objeetive funetion on the 

Grassmann manifold from Eq. 17 of the main paper, and thus ultimately learn the mapping W to a 

lower-dimensional subspaee where the /-divergenee refleets the ehosen affinity measure. 

References 

P-A Absil and KA Gallivan. Joint diagonalization on the oblique manifold for independent eompo- 
nent analysis. Teehnieal report, Universite eatholique de Louvain, 2006. 

Syed Mumtaz Ali and Samuel D Silvey. A general elass of eoeffieients of divergenee of one distri¬ 
bution from another. Journal of the Royal Statistical Society. Series B (Methodological), 1966. 

S-I Amari and A Ciehoeki. Information geometry of divergenee funetions. Bulletin of the Polish 
Academy of Sciences: Technical Sciences, 58(1), 2010. 

M. Baktashmotlagh, M.T. Harandi, B.C. Lovell, and M. Salzmann. Domain adaptation on the 
statistieal manifold. In CVPR, 2014. 

C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer, 1984. 

Kevin Miehael Carter. Dimensionality reduction on statistical manifolds. PhD thesis. University of 
Miehigan, 2009. 

Antoni B. Chan and Nuno Vaseoneelos. Probabilistie kernels for the elassifieation of auto-regressive 
visual proeesses. In CVPR, 2005. 

A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos. Jensen-bregman logdet divergenee with 
applieation to effieient similarity seareh for eovarianee matriees. Pattern Analysis and Machine 
Intelligence, IEEE Transactions on, 35(9), 2013. 

John P Cunningham and Zoubin Ghahramani. Linear dimensionality reduetion: Survey, insights, 
and generalizations. JMLR, pages -, 2015. 

Navneet Dalai and Bill Triggs. Histograms of oriented gradients for human detention. In CVPR, 
2005. 

Alan Edelman, Tomas A. Arias, and Steven T. Smith. The geometry of algorithms with orthogonal¬ 
ity eonstraints. SIAM J. Matrix Anal. Appl, 1999. 

C. Leiehtenhofer, A. Pinz, and R.P Wildes. Bags of spaeetime energies for dynamie seene reeogni- 
tion. In CVPR, 2014. 

Aasa Leragen, Lraneois Lauze, and Soren Hauberg. Geodesie exponential kernels: When eurvature 
and linearity eonfliet. In CVPR, pages 3032-3042, June 2015. 

Bernard Ghanem and Narendra Abuja. Maximum margin distanee learning for dynamie texture 
reeognition. In ECCV, 2010. 

Jihun Hamm and Daniel D Lee. Grassmann diseriminant analysis: a unifying view on subspaee- 
based learning. In ICML, 2008. 


17 



Soren Hauberg, Aasa Feragen, and Michael J Black. Grassmann averages for scalable robust pea. 
In CVPR, pages 3810-3817, 2014. 

Matthias Hein and Olivier Bousquet. Hilbertian metrics and positive definite kernels on probability 
measures. In AISTATS, 2005. 

Zhiwu Huang, Shiguang Shan, Haihong Zhang, Shihong Lao, Alifu Kuerban, and Xilin Chen. 
Benchmarking still-to-video face recognition via partial and local linear discriminant analysis 
on cox-s2v dataset. In ACCF, 2013. 

Zhiwu Huang, Ruiping Wang, Shiguang Shan, and Xilin Chen. Face recognition on large-scale 
video in the wild with hybrid euclidean-and-riemannian metric learning. Pattern Recognition 
(PR), 2015. 

Minyoung Kim, Sanjiv Kumar, Vladimir Pavlovic, and Henry Rowley. Face tracking and recogni¬ 
tion with visual constraints in real-world videos. In CVPR, 2008. 

Jiwen Lu, Gang Wang, and R Moulin. Image set classification using holistic multiple order statistics 
features and localized multi-kernel metric learning. In ICCV, 2013. 

Pedro J. Moreno, Purdy P. Ho, and Nuno Vasconcelos. A kullback-leibler divergence based kernel 
for svm classification in multimedia applications. In NIPS. 2004. 

Timo Ojala, Matti Pietikainen, and Topi Maenpaa. Multiresolution gray-scale and rotation invariant 
texture classification with local binary patterns. TPAMI, 24, 2002. 

Yu Qiao and Nobuaki Minematsu. A study on invariance of-divergence and its application to speech 
recognition. IEEE Trans, on Signal Processing, 58(7), 2010. 

C Radhakrishna Rao. Information and accuracy attainable in the estimation of statistical parameters. 
Bulletin of the Calcutta Mathematical Society, 37(3), 1945. 

A. Ramirez Rivera and O. Chae. Spatiotemporal directional number transitional graph for dynamic 
texture recognition. TPAMI, 2015. 

Avinash Ravichandran, Paolo Favaro, and Rene Vidal. A unified approach to segmentation and 
categorization of dynamic textures. In ACCV. 2011. 

Nitesh Shroff, Pavan Turaga, and Rama Chellappa. Moving vistas: Exploiting motion for describing 
scenes. In CVPR, 2010. 

A. Srivastava, I. Jermyn, and S. Joshi. Riemannian analysis of probability density functions with 
applications in vision. In CVPR, 2007. 

Ruiping Wang, Huimin Guo, Larry S Davis, and Qionghai Dai. Covariance discriminative learning: 
A natural and efficient approach to image set classification. In CVPR, 2012. 

SP Zhang, S Kasiviswanathan, PC Yuen, and M Harandi. Online dictionary learning on symmetric 
positive definite manifolds with vision applications. In The AAAI Conf. Artificial Intelligence 
(AAAI), 2015. 


18 



Guoying Zhao and Matti Pietikainen. Dynamic texture recognition using local binary patterns with 
an application to facial expressions. TPAMI, 29(6), 2007. 

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep 
features for scene recognition using places database. In NIPS, 2014. 


19 



